Search CORE

61 research outputs found

Provable Deterministic Leverage Score Sampling

Author: Boutsidis C.
Gittens A.
Hong Y. P.
Kunegis J.
Publication venue
Publication date: 02/06/2014
Field of study

We explain theoretically a curious empirical phenomenon: "Approximating a matrix by deterministically selecting a subset of its columns with the corresponding largest leverage scores results in a good low-rank matrix surrogate". To obtain provable guarantees, previous work requires randomized sampling of the columns with probabilities proportional to their leverage scores. In this work, we provide a novel theoretical analysis of deterministic leverage score sampling. We show that such deterministic sampling can be provably as accurate as its randomized counterparts, if the leverage scores follow a moderately steep power-law decay. We support this power-law assumption by providing empirical evidence that such decay laws are abundant in real-world data sets. We then demonstrate empirically the performance of deterministic leverage score sampling, which many times matches or outperforms the state-of-the-art techniques.Comment: 20th ACM SIGKDD Conference on Knowledge Discovery and Data Minin

arXiv.org e-Print Archive

Crossref

Optimal CUR Matrix Decompositions

Author: Boutsidis C.
Drineas P.
Gu M.
Guruswami V.
Wang S.
Publication venue
Publication date: 16/07/2014
Field of study

The CUR decomposition of an

m \times n

matrix

A

finds an

m \times c

matrix

C

with a subset of

c < n

columns of

A,

together with an

r \times n

matrix

R

with a subset of

r < m

rows of

A,

as well as a

c \times r

low-rank matrix

U

such that the matrix

C U R

approximates the matrix

A,

that is,

|| A - CUR ||_F^2 \le (1+\epsilon) || A - A_k||_F^2

, where

||.||_F

denotes the Frobenius norm and

A_k

is the best

m \times n

matrix of rank

k

constructed via the SVD. We present input-sparsity-time and deterministic algorithms for constructing such a CUR decomposition where

c=O(k/\epsilon)

and

r=O(k/\epsilon)

and rank

(U) = k

. Up to constant factors, our algorithms are simultaneously optimal in

c, r,

and rank

(U)

.Comment: small revision in lemma 4.

arXiv.org e-Print Archive

CiteSeerX

Crossref

Block CUR: Decomposing Matrices using Groups of Columns

Author: C Boutsidis
CW Yip
J Yang
M Rudelson
MW Mahoney
P Drineas
P Drineas
P Paschou
S Jain
Publication venue
Publication date: 09/07/2018
Field of study

A common problem in large-scale data analysis is to approximate a matrix using a combination of specifically sampled rows and columns, known as CUR decomposition. Unfortunately, in many real-world environments, the ability to sample specific individual rows or columns of the matrix is limited by either system constraints or cost. In this paper, we consider matrix approximation by sampling predefined \emph{blocks} of columns (or rows) from the matrix. We present an algorithm for sampling useful column blocks and provide novel guarantees for the quality of the approximation. This algorithm has application in problems as diverse as biometric data analysis to distributed computing. We demonstrate the effectiveness of the proposed algorithms for computing the Block CUR decomposition of large matrices in a distributed setting with multiple nodes in a compute cluster, where such blocks correspond to columns (or rows) of the matrix stored on the same node, which can be retrieved with much less overhead than retrieving individual columns stored across different nodes. In the biometric setting, the rows correspond to different users and columns correspond to users' biometric reaction to external stimuli, {\em e.g.,}~watching video content, at a particular time instant. There is significant cost in acquiring each user's reaction to lengthy content so we sample a few important scenes to approximate the biometric response. An individual time sample in this use case cannot be queried in isolation due to the lack of context that caused that biometric reaction. Instead, collections of time segments ({\em i.e.,} blocks) must be presented to the user. The practical application of these algorithms is shown via experimental results using real-world user biometric data from a content testing environment.Comment: shorter version to appear in ECML-PKDD 201

arXiv.org e-Print Archive

Crossref

Approximations of Schatten Norms via Taylor Expansions

Author: C Boutsidis
I Han
M Hutchinson
S Ubaru
Publication venue
Publication date: 07/08/2018
Field of study

In this paper we consider symmetric, positive semidefinite (SPSD) matrix

A

and present two algorithms for computing the

p

-Schatten norm

\|A\|_p

. The first algorithm works for any SPSD matrix

A

. The second algorithm works for non-singular SPSD matrices and runs in time that depends on

\kappa = {\lambda_1(A)\over \lambda_n(A)}

, where

\lambda_i(A)

is the

i

-th eigenvalue of

A

. Our methods are simple and easy to implement and can be extended to general matrices. Our algorithms improve, for a range of parameters, recent results of Musco, Netrapalli, Sidford, Ubaru and Woodruff (ITCS 2018) and match the running time of the methods by Han, Malioutov, Avron, and Shin (SISC 2017) while avoiding computations of coefficients of Chebyshev polynomials

arXiv.org e-Print Archive

Crossref

Spectral Clustering: An Empirical Study of Approximation Algorithms and its Application to the Attrition Problem

Author: Boutsidis C.
Cung B.
Jin T
Needell Deanna
Ramirez J.
Thompson A.
Publication venue: Scholarship @ Claremont
Publication date: 14/11/2012
Field of study

Clustering is the problem of separating a set of objects into groups (called clusters) so that objects within the same cluster are more similar to each other than to those in different clusters. Spectral clustering is a now well-known method for clustering which utilizes the spectrum of the data similarity matrix to perform this separation. Since the method relies on solving an eigenvector problem, it is computationally expensive for large datasets. To overcome this constraint, approximation methods have been developed which aim to reduce running time while maintaining accurate classification. In this article, we summarize and experimentally evaluate several approximation methods for spectral clustering. From an applications standpoint, we employ spectral clustering to solve the so-called attrition problem, where one aims to identify from a set of employees those who are likely to voluntarily leave the company from those who are not. Our study sheds light on the empirical performance of existing approximate spectral clustering methods and shows the applicability of these methods in an important business optimization related problem

arXiv.org e-Print Archive

Scholarship@Claremont

Optimal Principal Component Analysis in Distributed and Streaming Models

Author: Benson A. R.
Bhojanapalli S.
Borgne Y.-A. L.
Boutsidis C.
Cohen M. B.
Feldman D.
Ghashami M.
Golub G. H.
Kannan R.
Macua S. V.
Qu Y.
Woodruff D.
Publication venue
Publication date: 11/07/2016
Field of study

We study the Principal Component Analysis (PCA) problem in the distributed and streaming models of computation. Given a matrix

A \in R^{m \times n},

a rank parameter

k < rank(A)

, and an accuracy parameter

0 < \epsilon < 1

, we want to output an

m \times k

orthonormal matrix

U

for which

|| A - U U^T A ||_F^2 \le \left(1 + \epsilon \right) \cdot || A - A_k||_F^2,

where

A_k \in R^{m \times n}

is the best rank-

k

approximation to

A

. This paper provides improved algorithms for distributed PCA and streaming PCA.Comment: STOC2016 full versio

arXiv.org e-Print Archive

Crossref

Clustered Matrix Approximation

Author: Backstrom L.
Berkant Savas
Boutsidis C.
Boutsidis C.
Deerwester S.
Dhillon I. S.
Inderjit S. Dhillon
Kunegis J.
Leskovec J.
Savas B.
Shin D.
Si S.
Song H. H.
Sui X.
Whang J. J.
Zeimpekis D.
Zeimpekis D.
Publication venue: 'Society for Industrial & Applied Mathematics (SIAM)'
Publication date
Field of study

Crossref

Do Parents Recognize Autistic Deviant Behavior Long before Diagnosis? Taking into Account Interaction Using Computational Methods

Author: A Cnann
A De Giacomo
A Fernald
A Mahdhaoui
A Strehl
AM Girardot
Ammar Mahdhaoui
C Boutsidis
C Colombi
C Saint-Georges
C Trevarthen
Catherine Saint-Georges
D Messinger
David Cohen
DD Lee
F Muratori
Fabio Apicella
Filippo Muratori
G Salton
James G. Scott
K Devarajan
KK McGregor
L Zwaigenbaum
L Zwaigenbaum
LJ Gogate
LR Young
M Kawai
Marie-Christine Laznik
Mohamed Chetouani
N Yirmiya
N Yirmiya
Pietro Muratori
PK Kuhl
R Feldman
R Landa
R Palomo
Raquel S. Cassel
RJ Brand
RJ Landa
S Maestro
S Maestro
S Ozonoff
Sandra Maestro
T Striano
ZL Wu
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

BACKGROUND: To assess whether taking into account interaction synchrony would help to better differentiate autism (AD) from intellectual disability (ID) and typical development (TD) in family home movies of infants aged less than 18 months, we used computational methods. METHODOLOGY AND PRINCIPAL FINDINGS: First, we analyzed interactive sequences extracted from home movies of children with AD (N = 15), ID (N = 12), or TD (N = 15) through the Infant and Caregiver Behavior Scale (ICBS). Second, discrete behaviors between baby (BB) and Care Giver (CG) co-occurring in less than 3 seconds were selected as single interactive patterns (or dyadic events) for analysis of the two directions of interaction (CG→BB and BB→CG) by group and semester. To do so, we used a Markov assumption, a Generalized Linear Mixed Model, and non negative matrix factorization. Compared to TD children, BBs with AD exhibit a growing deviant development of interactive patterns whereas those with ID rather show an initial delay of development. Parents of AD and ID do not differ very much from parents of TD when responding to their child. However, when initiating interaction, parents use more touching and regulation up behaviors as early as the first semester. CONCLUSION: When studying interactive patterns, deviant autistic behaviors appear before 18 months. Parents seem to feel the lack of interactive initiative and responsiveness of their babies and try to increasingly supply soliciting behaviors. Thus we stress that credence should be given to parents' intuition as they recognize, long before diagnosis, the pathological process through the interactive pattern with their child

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

HAL-Inserm

PubMed Central

Archivio della Ricerca - Università di Pisa

The projection score - an evaluation criterion for variable subset selection in PCA visualization

Author: AA Shabalin
AE Raftery
C Boutsidis
C Haslinger
Charlotte Soneson
DA Jackson
DM Witten
DT Ross
E Bair
GP McCabe
H Hotelling
H Hotelling
H Shen
H Zou
H Zou
I Guyon
IM Johnstone
IM Johnstone
IT Jolliffe
IT Jolliffe
IT Jolliffe
K Hoffmann
K Pearson
M Lee
Magnus Fontes
ME Ross
MG Tadesse
O Modlich
PR Peres-Neto
R Tibshirani
R Varshavsky
S Bungaro
S Dray
SY Kassim
T Hastie
T Hastie
TR Golub
WJ Krzanowski
Y Liu
Y Lu
ZD Bai
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background In many scientific domains, it is becoming increasingly common to collect high-dimensional data sets, often with an exploratory aim, to generate new and relevant hypotheses. The exploratory perspective often makes statistically guided visualization methods, such as Principal Component Analysis (PCA), the methods of choice. However, the clarity of the obtained visualizations, and thereby the potential to use them to formulate relevant hypotheses, may be confounded by the presence of the many non-informative variables. For microarray data, more easily interpretable visualizations are often obtained by filtering the variable set, for example by removing the variables with the smallest variances or by only including the variables most highly related to a specific response. The resulting visualization may depend heavily on the inclusion criterion, that is, effectively the number of retained variables. To our knowledge, there exists no objective method for determining the optimal inclusion criterion in the context of visualization. Results We present the projection score, which is a straightforward, intuitively appealing measure of the informativeness of a variable subset with respect to PCA visualization. This measure can be universally applied to find suitable inclusion criteria for any type of variable filtering. We apply the presented measure to find optimal variable subsets for different filtering methods in both microarray data sets and synthetic data sets. We note also that the projection score can be applied in general contexts, to compare the informativeness of any variable subsets with respect to visualization by PCA. Conclusions We conclude that the projection score provides an easily interpretable and universally applicable measure of the informativeness of a variable subset with respect to visualization by PCA, that can be used to systematically find the most interpretable PCA visualization in practical exploratory analysis.</p

Lund University Publications

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Feature-by-Feature – Evaluating De Novo Sequence Assembly

Author: A Hyvärinen
A Hyvärinen
AM Phillippy
Andrey Rzhetsky
Bud Mishra
C Boutsidis
D Zerbino
DA Earl
DC Richter
DD Sommer
ES Lander
F Menges
Francesco Vezzi
G Narzisi
G Narzisi
G Sutton
Giuseppe Narzisi
H Lu
I Imam
I Johnstone
I Jolliffe
J Bi
J Liu
J Miller
J Simpson
JR Miller
LI Nahlawi
M Prasad
N Nagarajan
R Li
R Li
S Boisvert
SL Salzberg
X Huang
Y Lin
Publication venue: Public Library of Science
Publication date: 03/02/2012
Field of study

The whole-genome sequence assembly (WGSA) problem is among one of the most studied problems in computational biology. Despite the availability of a plethora of tools (i.e., assemblers), all claiming to have solved the WGSA problem, little has been done to systematically compare their accuracy and power. Traditional methods rely on standard metrics and read simulation: while on the one hand, metrics like N50 and number of contigs focus only on size without proportionately emphasizing the information about the correctness of the assembly, comparisons performed on simulated dataset, on the other hand, can be highly biased by the non-realistic assumptions in the underlying read generator. Recently the Feature Response Curve (FRC) method was proposed to assess the overall assembly quality and correctness: FRC transparently captures the trade-offs between contigs' quality against their sizes. Nevertheless, the relationship among the different features and their relative importance remains unknown. In particular, FRC cannot account for the correlation among the different features. We analyzed the correlation among different features in order to better describe their relationships and their importance in gauging assembly quality and correctness. In particular, using multivariate techniques like principal and independent component analysis we were able to estimate the “excess-dimensionality” of the feature space. Moreover, principal component analysis allowed us to show how poorly the acclaimed N50 metric describes the assembly quality. Applying independent component analysis we identified a subset of features that better describe the assemblers performances. We demonstrated that by focusing on a reduced set of highly informative features we can use the FRC curve to better describe and compare the performances of different assemblers. Moreover, as a by-product of our analysis, we discovered how often evaluation based on simulated data, obtained with state of the art simulators, lead to not-so-realistic results

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central